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Executive Summary 

Test takers tend to differ from one another in the amount of time required to respond to items. This is 
true even among test takers of the same ability level. Although this finding is not surprising, it may lead to a 
serious scoring problem. If some test takers do not complete all test items, the test-scoring procedure must 
include a provision for unreached items. Such items could be treated as incorrect (e.g., a test taker's final 
score could be influenced by the number of unreached items) or unreached items could be ignored (i.e., 
treated as missing). This decision should be made according to beliefs about the independence and relative 
importance of response speed and response accuracy in the context of the test. 

If speed and accuracy are independent and the test is designed to measure accuracy, test taker ability 
should be based on accuracy alone, and test takers should not be penalized for unreached items. If speed and 
accuracy are related or if both are important in the test context, response speed may be included in the 
scoring rubric, and unreached items would count against a test taker. In the latter case, estimates of test taker 
ability would reflect both response speed and response accuracy. Realistic scoring models that combine 
measures of speed and accuracy are not yet available, but the scant empirical research concerning the 
relationship between response speed and response accuracy in large-scale testing suggest that speed and 
accuracy are independent factors in power tests (i.e., tests that measure accuracy alone). 

The best solution to the problem of unreached items may be to design the test in such a way that they do 
not occur or are minimized. This could be accomplished with very generous time limits (a costly solution). 
Computer adaptive testing (CAT), however, offers an attractive alternative. Test taker speed can be assessed 
along with test taker ability (measured in terms of response accuracy), and the estimated test taker speed can 
be included in the item-selection algorithm. Thus, items are selected for a test taker that are appropriate for 
the test taker's ability, but are unlikely to be so time-consuming that the test taker fails to complete all test 
items. This solution requires a model of response speed and an item-selection algorithm that accommodates 
response-speed constraints. Both aspects are addressed by the current work. 

A model of response speed is used as the basis for predicting a test taker's response time for each item in 
the item pool. Items are selected according to an algorithm for constrained CAT. The item-selection 
algorithm constrains item selection so that the test taker is likely to have sufficient time to answer all items 
while simultaneously insuring that test specifications are met and all test takers receive items that are 
tailored to their ability level. Response-time predictions are modified according to the time taken by the test 
taker to respond to items already administered. Analyses of operational data from a large-scale standardized 
test support the use of the response-speed model, and simulations of the item-selection algorithm demonstrate 
that response-time constraints could be included in item selection while maintaining test quality. 

The present approach to adaptive item selection is a solution to the scoring problems introduced by 
differences in response speed across test takers. This solution may be preferable to the obvious alternatives 
of reduced test length (with a reduction in measurement precision) or increased time limits (with added 
administration costs). The preliminary results reported here demonstrate the reasonableness of the 
response-speed model and the feasibility of including response-time constraints in item selection. 

Abstract 

Test takers with the same ability differ in the amount of time they need to complete a test item. 

Therefore, some test takers may be affected unfavorably by the presence of a time limit on a test. This paper 
proposes an item-selection algorithm that can be used to neutralize the effect of time limits in computer 
adaptive testing. The method is based on a statistical model for the response-time distributions of the test 
takers on the items in the pool that is updated each time a new item has been administered. Predictions from 
the model are used as constraints in a 0-1 LP model for constrained adaptive testing that maximizes the 
accuracy of the ability estimator. The method is demonstrated empirically using an item pool from an 
operational, large-scale computer adaptive test. 



Introduction 

Test takers generally need different amounts of time to complete the same item in an educational or 
psychological test even if they have the same ability. As a consequence, some test takers may finish the test 
in time whereas others do not reach the items at the end of the test. However, unreached items involve a 
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serious scoring problem. First of all, in a conventional paper-and-pencil test it is impossible to discriminate 
between unreached items and reached items that were left unanswered because their answers were 
unknown. But even if it were exactly known which items were not reached (as is possible in computerized 
testing), scoring would remain a complicated issue. If the test is designed such that speed and ability are 
independent factors, responses to items not reached can be viewed as missing at random and should be 
ignored when the test is scored. For the notion of data that are missing at random, see Gelman, Carlin, Stern, 
and Rubin (1995, sect. 7.4). Flowever, if speed and ability are dependent, the correct way to score the test is 
under a model that represents their joint effect on the probabilities of success on the items. Realistic versions 
of such models are not available yet. 

Empirical studies of the relation between response time and ability have become possible through the 
introduction of computerized testing but are still hard to find. A favorable exception is a recent study of the 
response times in a field test of a computerized version of the National Board of Medical Examiners (NBME) 
Step 2 Licensure Exam (Swanson, Featherman, Case, Luecht, & Nungester, 1997, April). In this study, items 
in linear subtests were timed, and no correlation between response time and ability was found for the 
various subtests. However, a replication of the study for the NBME Step 1 Licensure Exam showed moderate 
correlation toward the end of the subtests apparently due to the use of more stringent time limits (Swanson, 
personal communication, December 18, 1997). These results seem to suggest that the only effective way to 
deal with scoring problems due to unreached items is to design the test such that they do not occur. 

However, given the variability in response time among test takers, for a conventional linear test this 
approach would imply either the need of a shorter test with loss of accuracy in ability estimation for the 
faster test takers or a more generous time limit for the test and thus an increase in costs. 

In computer adaptive testing (CAT), an attractive solution to this test design dilemma is possible. If a 
model for the response-time distributions of the test takers on the items in the pool is available, the actual 
response times on the items administered to the current test taker can be recorded and used to update the 
estimates of these distributions for the remaining items in the pool. These distributions can then be used to 
constrain the selection of the next items in the test to give the test the same degree of speededness for all test 
takers. Of course, this procedure is only feasible if a model for the response-time distributions with a 
satisfactory fit to actual response-time data is available. 

It is the purpose of this paper to present an algorithm for adaptive testing that builds on this idea. In 
addition to the usual update of the estimate of the ability parameter in an IRT model, a lognormal model for 
the response-time distributions is used to update the response-time estimates for the items in the pool. 
Response-time constraints are derived from these estimates and imposed on the item selection using a 0-1 
linear programming (LP) algorithm for constrained adaptive testing. The following sections of this paper 
introduce the model and the algorithm. The algorithm is then studied empirically using an item pool and 
estimates of response-time parameters for a large-scale, operational adaptive test. The main purpose of the 
study was to ascertain the effects of the response-time constraints on the statistical properties of the ability 
estimator. The final section of the paper discusses some remaining aspects of the implementation of the 
algorithm in the practice of educational and psychological testing. 

Model for Response Times 

The response time of test taker j on item i is denoted by a variable T... The variable is assumed to be 
random because replications of tasks by the same subject are generally known to show variation in the time 
needed to complete them (Luce, 1986, sect. 1.2; Townsend & Ashby, 1983, chap. 3). The following 
decomposition for the (natural) logarithm of T. is assumed as a model for its distribution; 

InT;. = , (1) 



with 






where [lis the grand mean or general response time level for the item pool and population of test takers,! . is 
an effect parameter for the slowness of test taker 5. for the amount of time demanded by item i, and e .. is a 
normally distributed residual or interaction term. Together, Equations 1-2 imply a lognormal distribution for 
the observed response times of a fixed test taker taking a fixed item. The model was proposed in Scrams and 
Schnipke (in preparation). The effect terms and 6. are defined to have expectations equal to zero across 
takers and items, respectively. The margmal distribution of log response time across test takers for a 
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fixed item also depends on the distribution ofT. This distribution is examined as one test of the goodness of 
fit of the model later in this paper. 

Observe that the distributions in Equations 1-2 vary in location across test takers and items but have a 
common variance. The last assumption is stringent but allows us to estimate the parameters in the model in 
a straightforward way. In addition, since the model will be used to constrain item selection using only a 
percentile in the upper tail of the distributions toward the end of the test, a slight misfit of the model would 
not seem to lead to serious item selection errors . However, whenever using the model, it should be standard 
practice to check its assumptions. 

For future reference, note that 



T.e£,(ln7;.)-n, 

o^-£.[ln7;.-5,-T,T 



(3) 

(4) 

(5) 

( 6 ) 



Throughout this paper subscripts at expectation signs denote indices over which expectations are taken. 
A different use of the lognormal distribution as a model for response times is made in Thissen's (1983) model 
for timed testing. In his model, the lognormal distribution is parameterized to be dependent on the latent 
ability measured by the items and becomes part of the likelihood function used for estimating the test taker's 
ability from the joint distribution of the item scores and the response times. Thissen also found adequate fit 
for a series of tests, except for one which showed an overrepresentation of fast responses due to the relative 
easiness of its items. Other distributions used to study response times on test items are the Weibull (Roskam, 
1997) and the gamma distribution (Verhelst, Verstalen, & Jansen, 1997). In a previous study, the lognormal 
distribution showed a good fit to the response time distributions on an item pool for an operational, 
large-scale CAT, outperforming the Weibull and gamma distributions (Schnipke & Scrams, 1997). These 
results will be further discussed below when an empirical example is presented. 

IRT Model 

It is assumed that the item pool has been calibrated using an IRT model. In the empirical example later 
in this paper, the item pool was calibrated using the 3-parameter logistic (3-PL) model. The model describes 
the probability of a correct response on item i as: 

p,(0) = Prob{u. = l|e} = c, +(l-c,.){l + exp[-a,.(e-Aj]} ' , 



where 0 is the unknown ability of the test taker and a. e [0,°°], b. e [-o°, o°], and c. e [0,l] are the 
discrimination, difficulty, and guessing parameter for item i, respectively (Lord, 1980, chap. 2). 

A key quantity in IRT is Fisher's information on the unknown ability parameter. For a test of n items, the 
measure is defined as: 









(8) 



In Equation 8, l( 0 | , ..., ) is the likelihood statistic associated with the (random) response vector 

, ..., L7„. For the 3-PL model in Equation 7 it holds that 
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u^ u„ 



(0)=I 



i = 1 



(p;(9))' 

p,(0)(i- p,(0))’ 



with 






(10) 



(Lord, 1980, chap. 5). The term in Equation 9 for item i will be denoted as 7,(0) and is the item information 
used in the maximum-information item selection criterion in the CAT algorithm below. 

Response-Time Constraints in CAT 

It is assumed that the item pool consists of items indexed by z = 1, J. In addition, the CAT is assumed 
to consist of items indexed hyk = l, n. Thus, index z\ represents the event of the zth item in the pool being 
administered as the kth item in the CAT The index values of the first k-l items in the CAT are denoted by 
the set Sjt_, The remaining items in the pool are denoted by the setR^ = The kth 

item in the test is chosen from the setR^^* 

The basic idea is to update an estimate of the test taker's slowness parameter! in Equations 1-2 during 
the test given accurate estimates of |i, item parameters 8, ,z =1, ..., J, and the residual variance, o^. The 
improved estimates of are used to update projections of the time needed to complete each of the remaining 
items in the pool. The next item is then selected subject to a constraint based on these projections as well as 
the time available to complete the remaining portion of the test. A Bayesian framework is used to update the 
response time projections whereas the response time constraints are incorporated in the item selection 
procedure using a 0-1 linear programming (LP) model for constrained CAT that maximizes the information 
on 0 in the test and also allows for additional constraints that can be used to guarantee its content validity. 

Updating Response-Time Estimates 

It is assumed that 8., z = 1, ..., J, and a^have been estimated precisely enough to be considered as known. 
Estimates can easily be obtained from the response times in the calibration sample using the Equations in 
4 and 6. However, if test taker j is tested, T. is an unknown parameter; it is assumed to have a normal prior 
distribution: 

X,. . 



The model in Equations 1-2 yields a normal likelihood with unknown mean and known variance that 
has the normal distribution as its family of conjugate priors (Gelman, Carlin, Stern & Rubin, 1995, sect. 2.6). 
Hence, noting!^. = ]n(t. )-p-8., the posterior distribution ofx . after the response times on items z\ , ..., 

Zj^ _^ have been recorded, is normal with mean and variance: 



k- 1 



p = 1 



( 12 ) 



J I't ,, y. • • • . L, + ^0 



(13) 



Also, the predictive density for the logarithm of the response time of test taker; on item z after items z^ , 
..., z\_ J is normal with mean equal to the posterior mean and variance equal to the sum of the prior and 
posterior variances: 
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(14) 



E{\nT, = £(t, + 

Var{\nT,^ U,., = c\j+Var{z^ Uv--.//*- J • 



As the test takers are assumed to be exchangeable, an obvious choice for the parameters in this prior is to 
equate them to the mean and variance of the population of test takers: 



= £, £,,(ln7;.) = 0, 



(16) 






(17) 



for all j. Using Equations 16-17, the mean and variance in Equations 14-15 specialize to: 



£(ln7;. = H+5,+ 



p = I 



o ! a, +k-\ 



(18) 



a' + kcl 



/ \ a + KOI 



/a' 



(19) 



Because 5., / = 1 , ..., 7, and C7^ are assumed to be estimated using Equations 4 and 6, respectively, the 
expressions in Equations 18 and 19 have only known constants and are easy to calculate. 

Let f be the ath certain percentile in this posterior predictive density for ln(T,y) transformed back to the 
original time scale. The choice of item will now be constrained using this percentile for all remaining items 
in the pool, i g R^. As will become clear later, it makes sense to choose a value for a near the middle of the 
density in the beginning of the test, for example, the expected value of the posterior predicted density in 
Equation 18, and move to percentiles in the upper tail toward the end of the test. 

Constrained CAT Algorithm 

The kth item is selected according to an algorithm for constrained CAT presented in van der Linden and 
Reese (1998). To select the initial item, the algorithm first selects a full test that meets all constraints to be 
imposed on the selection of items in the CAT and has maximum information at the initial ability estimate. 
The item actually administered is the one from the assembled test with maximum information at this ability 
estimate. At each next step, the test is reassembled to have maximum information at the updated ability 
estimate fixing the items already administered. Again, the item to be administered is selected from the new 
portion of the test to have maximum information at the updated ability estimate. The procedure is repeated 
until the last item is selected. 

The fact that a full test is assembled at each step rather than a single item keeps the actual item selection 
feasible with respect to the set of constraints. Because both the test assembly and the selection of the 
individual item have the objective of maximum information, the ability estimator can be expected to be 
maximally informative too. All test assembly is done while the test taker takes the test and is based on a 0-1 
LP model that represents all test specifications. 

To discuss the model, decision variables i = 1 ,..., 7, are introduced that take the value 1 if item i is 
selected in the test and the value 0 otherwise. The total amount of time available for the CAT is denoted as 
t^Qf. In addition, it is assumed that the composition of the test is constrained with respect to a variety of 
categorical attributes, such as content, cognitive level, and item format. These attributes partition the item 
pool into a collection of sets V^, g = 1 , ..., G, each of which is defined by one or more attribute values. Also, 
the composition of the test can be constrained with respect to several quantitative attributes h = 1 ,... , H, 
such as word counts, exposure rates, and IRT parameter values. Finally, let0^_,be the estimate of 0 after /c-i 
items have been administered. 
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The decision variables are used to formulate the 
current test taker: 



following linear model for selecting item k for the 



subject to 



maximize X (Q* - i ) 



X >ijX, + X , 

>6 5*. I f’e/fj 

/e _ I 

X ^ 

i= ] 

I 

i - I 
I 

i = 1 



( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 

(26) 
(27) 



a:, e{0,l}, / = 1,...,/. 



(28) 



The objective function in Equation 20 optimizes the information in the test at . The total length of the 
test is set at n items in Equation 23, whereas Equation 22 fixes the values of the decision variables of all items 
that have already been administered to the test taker at 1. The key constraint in this paper is the one on 
response times in Equation 21 which requires the remaining n-k+ 1 items to be selected such that the sum 
of the ath percentiles of their predicted response time distributions plus the actual response time on the first 
k-i items not be larger than the total amount of time available. In Equations 24-25, the numbers of items with 
the various attribute categories are required to be between lower and upper bounds and respectively. 
Finally, the constraints in Equations 26-27 guarantee that the sums of values for the various quantitative 
attributes are between the bounds and 

At step k, the model thus selects n-k + 1 new items from the set . The item actually administered is 
the one selected from this set that is most informative at 0j^ . The cycle is then repeated to select item A: + 1. 

As already noted, it makes sense to choose close to the means of the posterior predictive response 
time densities in the beginning of the test but to move toward their upper tails later on. This suggestion is 
motivated by the fact that the sum of the mean predicted response times is a good predictor of the actual 
time for a large set of items but a more conservative predictor is needed if the set becomes smaller. 
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Other types of constraints can be added to the model to deal with possibly remaining test specifications. 
Several examples of possible additions are given in van der Linden (1998) and van der Linden and Reese 
(1998). The set of specifications should be large enough to guarantee a CAT with satisfactory content validity. 

Empirical Example 

A previous pool from an operational, large-scale CAT was used to study the behavior of the algorithm. 
The pool consisted of 186 items used for the adaptive version of an arithmetic reasoning test. The length of 
the test was 15 items. The items in the pool were calibrated using the 3-PL model in Equation 7. 

Response-time data were recorded for 38,357 test takers who took the test in 1997. 

The parameters in the response-time model in Equations 1-2 were estimated substituting sample 
statistics into Equations 3-6. The following results were obtained: p = 4.093,6= .515, whereas the estimated 
item and person effects, 6. and f . were distributed about zero with a standard deviation equal to .497 and 
.344, respectively. The correlation between Band t was equal to .035 indicating that ability and speed were 
independent variables for this test. 

Note that these standard deviations overestimate the standard deviations of the distributions of the true 
effects. However, because the sample of test takers is large, the bias ina^ is negligible. Moreover, this estimate 
is only reported here as a descriptive statistic; it does not play any role in the adaptive procedure in this example. 
The bias in 6^ is expected to be larger but this quantity serves as the variance of the prior fort . in the 
adaptive procedure. The result is thus a less informative prior, and hence a more conservative adaptive test. 
Also, note that these parameters were estimated ignoring the missing entries in the data matrix. This procedure 
is not assumed to yield biased estimates since ability and speed were estimated to be independent variables. 

The main goal of the study was to estimate the effect of the response-time constraints in Equation 21 on 
the statistical properties of the ability estimator. In particular, the bias and root mean squared error (RMSE) 
functions of the estimator were studied for various values of T. 

The ability values of the simulated test takers were chosen to be equal to 0 = -1.5, -0.5, 0.5, and 1.5. In 
addition, the response times were simulated at the values of t = -.60, -.30, .00, .30, and .60. Recall that the 
values of T in the empirical example of test takers were estimated to be distributed about .00 with a standard 
deviation equal to .515. The simulated! values thus cover the range of values in the sample of test takers. 

The number of replications for each combination of the Band T values was equal to 120. The ability estimator 
used was the expected value of the posterior ability parameter (EAR estimator) with a uniform prior 
distribution. The first item was selected to be optimal at0= 0; this selection is known to introduce a more 
favorable RMSE at this ability value and more unfavorable RMSEs toward the ends of the ability scale. The 
LP model was solved using the First Acceptable Integer Solution Algorithm from the ConTEST test assembly 
software package; a detailed description of the algorithm is available in the manual (Timminga, van der 
Linden, & Schweizer, 1996, sect. 6.6). On a PC with Pentium/ 133MHz processor the time needed to update 
the ability estimate, solve the LP model, and select the most informative item was always less than 1 second. 

The values of the parameter f ? in Equation 21 were set equal to the 50th percentile in the posterior 
predictive distributions for the selection of the item k = 1 and moved to the 95th percentile for the selection of 
item A: = 13 in equal steps. The last value was maintained for items k=lA and 15. The total amount of time 
available, was set equal to 39 minutes. The same time limit was used in the actual CAT. 

Fit of Response-Time Distribution 

The response-time distribution in Equations 1-2 was tested for its assumption of a lognormal shape 
against the assumptions of a normal, gamma, and Weibull distribution for the response times. Detailed 
results for the current data set are given in Schnipke and Scrams (1997). Since only single observations were 
recorded for each item and test taker combination, the distributions of the log response times on the items 
were inspected pooling the data across the test takers. The distribution of the estimated test taker parameters 
was approximately normal. Hence, the marginal distribution of the log response times for each item is also 
normal. This feature was checked in depth for 30 of the 186 items. These items were selected not to involve 
any figures and to be answered by at least 1,000 test takers. The samples were randomly split into halves 
used for parameter estimation and checking the distributional assumptions. The parameters of the four 
candidate models for the response-time distributions were estimated using the method ML estimation. 

Double probability plots were produced for each model separately for each item, with the observed 
cumulative probability function along the abscissa and the estimated function along the ordinate. A typical 
example is given in Figure 1. The lognormal distribution provided the best fit (indicated by most points 
falling along the diagonal), followed by the gamma, Weibull, and normal. The same essential result was 
found for all other items. 
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FIGURE 1. Double probability plot of the four response-time distributions for a typical item 
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FIGURE 2. RMSE of the items for the four response-time distributions 
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Fits were also examined using the RMSE calculated between the observed and estimated distribution 
functions at each fifth percentile. The results for all 30 items are provided in Figure 2. For readability, items 
were ranked according to the quality of fit provided by the lognormal distribution. Again, the lognormal 
distribution provided the best overall fit, and the other three distributions were ranked as before. 

The assumption of the common standard deviation across items was tested plotting the sample estimates 
of the standard deviations of the log response times for each of the 186 items against sample sizes in Figure 
3. Different items were administered different numbers of times. For smaller samples much of the observed 
variability may be attributed to sampling variation. However, for larger samples, the estimated standard 
deviations should stabilize about an identifiable mean. Figure 3 shows this stabilization to hold indeed about 
the value of 0.63 estimated for the common standard deviation. 




FIGURE 3. Standard deviation of the log response times for the items 
as a function of sample size 



RMSE and Bias Function of Ability Estimator 

The RMSE and bias functions of the ability estimator in the CAT algorithm were estimated as 

S[(e -e)' |0]/12OandX[0-0|0]/12Q, 

respectively. The results fori = -.60, -.30, .00, .30, and .60. are shown in Figure 4 (« = 15). Both for the bias 
and RMSE function, no systematic differences between the curves were obtained. Also, the results are of the 
same order as those for unconstrained adaptive testing with a variety of item-selection criteria in van der 
Linden (1998). The only conspicuous feature is the tendency of a slight positive bias, and hence a slightly 
larger RMSE, at0= -1.50. However, since the tendency is the same for all fiveT values, the result is believed 
to be an artifact due to the composition of the item pool. 
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FIGURE 4. Estimated RMSE and bias functions of the ability 
estimator after 15 items 



Conclusion 

Differential speededness among test takers is a problem in educational measurement. In standardized 
testing, solutions to the problem include relaxing the time limit or reducing the number of items 
administered. However, these solutions either result in greater administration costs or reduced measurement 
precision for all test takers. The solution proposed in this paper is to make the test adaptive and constrain 
item selection according to the time needed by the test takers. Response-time constraints on item selection 
are built into an adaptive procedure that maximizes the statistical information in the test about the test 
taker's ability. The model for the response-time distributions used is the lognormal. The data in the empirical 
example showed a satisfactory fit to the model. Also, for a variety of 0 and t values, the bias and the RMSE 
of the ability estimator in this example did not show any anomalies due to the presence of the response-time 
constraint in the item-selection procedure. 
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